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Abstract 

We consider finite horizon Markov decision processes under performance measures that involve both the mean 
and the variance of the cumulative reward. We show that either randomized or history-based policies can improve 
performance. We prove that the complexity of computing a policy that maximizes the mean reward under a variance 
constraint is NP-hard for some cases, and strongly NP-hard for others. We finally offer pseudopolynomial exact 
and approximation algorithms. 
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I. Introduction 

The classical theory of Markov decision processes (MDPs) deals with the maximization of the cumu- 
lative (possibly discounted) expected reward, to be denoted by W. However, a risk-averse decision maker 
may be interested in additional distributional properties of W. In this paper, we focus on the case where 
the decision maker is interested in both the mean and the variance of the cumulative reward, and we 
explore the associated computational issues. 

Risk aversion in MDPs is of course an old subject. In one approach, the focus is on the maximization 
of E[Z7(W)], wher e U is a c oncav e utility function. Problems of this type can be handled by state 



augmentation (e.g., iBertsekas 



19951), namely, by introducing an auxiliary state variable that keeps track 



of the cumulative past reward. In a few special cases, e.g., with an exponential utility function, state 
augmentation is unne cessary, and optimal policies can be found by solving a modified Bellman equation 



(IChung & Sobel . 



1987b . Another interesting case where optimal p olicies can be f ound 



piecewise linear utility functions with a single break point; see 



Liu and Koenigl d2005h 



efficiently involves 



In another approach, the objective is to optimize a so-called coherent risk measure ([Artzner. Delbaen. Eber. & Heat 



1999), which turns out to be equivalent to a robust optimization problem: one assumes a family of 



probabilistic models and optimizes the worst-case performance over this family. In the multistage case 



(Riedel, 



(|lyengai . 



20041). problems of this type can be difficult (|Le Tailed . 



2005; 



Nili m & El Ghaoui, 



20071) . except f or some specia l cases 



2005) that can be reduced to Markov games (jShapley , 



1953) 



Mean-variance optimization lacks some of the desirable properties of approaches involving coherent 
risk measures and sometimes leads to counterintuitive policies. Bellman's principle of optimality does 
not hold, and as a consequence, a decision maker who has received unexpectedly large rewards in the 
first stages, may actively seek to incur losses in subsequent stages in order to keep the variance small. 
Nevertheless, mea n-variance optimization is an important approach in financial decision making (e.g., 



Luenbergei 



1997|) . especially for static (one-stage) problems. Consider, for example, a fund manager who 



is interested in the 1-year performance of the fund, as measured by the mean and variance of the return. 
Assuming that the manager is allowed to undertake periodic re-balancing actions in the course of the year, 
one obtains a Markov decision process with mean-variance criteria. Mean-variance optimization can also 
be a meaningful objective in various engineering contexts. Consider, for example, an engineering process 
whereby a certain material is deposited on a surface. Suppose that the primary objective is to maximize 
the amount deposited, but that there is also an interest in having all manufactured components be similar 
to each other; this secondary objective can be addressed by keeping the variance of the amount deposited 
small. 

We no t e that expressions for the variance of the discounted reward for stationary policies were developed 



in iSobell (|1982l) . However, these expressions are quadratic in the underlying transition probabilities, and 
do not lead to convex optimization problems. 

Motivated by considerations such as the above, this paper deals with the computational complexity 
aspects of mean- variance optimization. The problem is not straightforward for various reasons. One is the 
absence of a principle of optimality that could lead to simple recursive algorithms. Another reason is that, 
as is evident from the formula Var(W) = E,[W 2 ] — (E[W]) 2 , the variance is not a linear function of the 
probability measure of the underlying process. Nevertheless, E[W 2 ] and E[W] are linear functions, and as 
such can b e addressed si multaneously using methods from multicriteria or constrained Markov decision 



processes (| Altaian , 



1999). Indeed, we will use such an approach in order to develop pseudopolynomial 
exact or approximation algorithms. On the other hand, we will also obtain various NP-hardness results, 
which show that there is little hope for significant improvement of our algorithms. 

The rest of the paper is organized as follows. In Section HU we describe the model and our notation. We 
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also define various classes of policies and performance objectives of interest. In Section Iflll we compare 
different policy classes and show that performance typically improves strictly as more general policies are 
allowed. In Section [IV] we establish NP-hardness results for the policy classes we have introduced. Then, in 
Sections [V] and [VO we develop exact and approximate pseudopolynomial time algorithms. Unfortunately, 
such algorithms do not seem possible for some of the more restricted classes of policies, due to strong 
NP-completeness results established in Section [IV] Finally, Section IVIII contains some brief concluding 
remarks. 

II. The Model 

In this section, we define the model, notation, and performance objectives that we will be studying. 
Throughout, we focus on finite horizon problems. [_ 

A. Markov Decision Processes 

We consider a Markov decision process (MDP) with finite state, action, and reward spaces. An MDP 
is formally defined by a sextuple M = (T, S, A, 1Z, p, g) where: 

(a) T, a positive integer, is the time horizon; 

(b) S is a finite collection of states, one of which is designated as the initial state; 

(c) A is a collection of finite sets of possible actions, one set for each state; 

(d) 1Z is a finite subset of Q (the set of rational numbers), and is the set of possible values of the 
immediate rewards. We let K = max rg7 ^ \r\. 

(e) p : {0, . . . , T — 1} x S x S x A — > Q describes the transition probabilities. In particular, p t (s' \ s, a) 
is the probability that the state at time t + 1 is s', given that the state at time t is s, and that action 
a is chosen at time t. 

(d) g : {0, . . . , T — 1} x 1Z x S x A — > Q is a set of reward distributions. In particular, g t (r \ s, a) is 
the probability that the immediate reward at time t is r, given that the state and action at time t is 
s and a, respectively. 

With few exceptions (e.g., for the time horizon T), we use capital letters to denote random variables, and 
lower case letters to denote ordinary variables. The process starts at the designated initial state. At every 
stage t = 0, 1, ... , T — 1, the decision maker observes the current state St and chooses an action A t . 

'Some of the results such as the approximation algorithms of Section [VT] can be extended to the infinite horizon discounted case; this is 
beyond the scope of this paper. 
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Then, an immediate reward R t is obtained, distributed according to g t ( ■ | S t , A t ), and the next state S t+1 
is chosen, according to p t ( ■ \ S t , A t ). Note that we have assumed that the possible values of the immediate 
reward and the various probabilities are all rational numbers. This is in order to address the computational 
complexity of various problems within the standard framework of digital computation. Finally, we will 
use the notation x 0:t to indicate the tuple (x , ■ ■ . , x t ). 

B. Policies 

We will use the symbol n to denote policies. Under a deterministic policy n = (p , . . . , p T _i), the 
action at each time t is determined according to a mapping fi t whose argument is the history H t = 
(S ; t , A 0:t _i, Ro-.t-i) of the process, by letting A t = fi t (H t ). We let 11^ be the set of all such history-based 
policies. (The subscripts are used as a mnemonic for the variables on which the action is allowed to 
depend.) We will also consider randomized policies. For this purpose, we assume that there is available 
a sequence of i.i.d. uniform random variables U , Ui, . . . , U T -i, which are independent from everything 
else. In a randomized policy, the action at time t is determined by letting A t = pit(H t , U 0:t ). Let Tl h:U be 
the set of all randomized policies. 

In classical MDPs, it is well known that restricting to Markovian policies (policies that take into account 

only the current state S t ) results in no loss of performance. In our setting, there are two different possible 

"states" of interest: the original state S t , or the augmented state (S t , W t ), where 

t-i 

k=0 

(with the convention that W = 0). Accordingly, we define the following classes of policies: Tl t , s (under 
which A t = fi t (S t )), and U t ,s,w (under which A t = p, t {S t , W t )), and their randomized counterparts H t , s ,u 
(under which A t = p t (S t , U t )), and Tl t ,s,w,u (under which A t = fi t (S t , W t , U t ). Notice that 

and similarly for their randomized counterparts. 

C. Performance Criteria 

Once a policy n and an initial state s is fixed, the cumulative reward W T becomes a well-defined random 
variable. The performance measures of interest are its mean and variance, defined by = E^Wt] and 
V n = Var 7r (WV), respectively. Under our assumptions (finite horizon, and bounded rewards), it follows 



that there are finite upper boundsof KT and K 2 T 2 , for \J^\ and V^, respectively, independent of the 
policy. 

Given our interest in complexity results, we will focus on "decision" problems that admit a yes/no 
answer, except for Section |VI] We define the following problem. 

Problem mv-mdp(II): Given an MDP M. and rational numbers A, v, does there exist a policy in the set 
II such that J n > A and V w < vl 

Clearly, an algorithm for the problem MV-MDP(Il) can be combined with binary search to solve (up to 
any desired precision) the problem of maximizing the expected value of Wt subject to an upper bound 
on its variance, or the problem of minimizing the variance of Wt subject to a lower bound on its mean. 

III. Comparison of Policy Classes 

Our first step is to compare the performance obtained from different policy classes. We introduce some 
terminology. Let LT and LT' be two policy classes. We say that LT is inferior to LT' if, loosely speaking, the 
policy class LT' can always match or exceed the "performance" of policy class LT, and for some instances 
it can exceed it strictly. Formally, LT is inferior to IT' if the following hold: (i) if (M.,c, d) is a "yes" 
instance of MV-MDP(IT), then it is also a "yes" instance of MV-MDP(ir'); (ii) there exists some (Ai,c,d) 
which is a "no" instance of MV-MDP(IT) but a "yes" instance of MV-MDP(n'). Similarly, we say that two 
policy classes IT and IT' are equivalent if every "yes" (respectively, "no") instance of mv-mdp(IT) is a 
"yes" (respectively, "no") instance of MV-MDP(n'). 

We define one more convenient term. A state s is said to be terminal if it is absorbing (i.e., p t (s | s, a) = 
1, for every t and a) and provides zero rewards (i.e., g t (0 \ s, a) = 1, for every t and a). 

A. Randomization Improves Performance 

Our first observation is that randomization can improve performance. This is not surprising given that 
we a re dealing simu ltaneously with two criteria, and that randomization is helpful in constrained MDPs 



(e.g. 



Alta ian. 



1999) 



Theorem 1. (a) U ts is inferior to Ht, s , u ; 

(b) U t ,s,w is inferior to U t ,s,w,u; 

(c) Uh is inferior to Hh, u - 
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Proof. It is clear that performance cannot deteriorate when randomization is allowed. It therefore suffices 
to display an instance in which randomization improves performance. 

Consider a one-stage MDP (T = 1). At time 0, we are at the initial state and there are two available 
actions, a and b. The mean and variance of the resulting reward are both zero under action a, and both 
equal to 1 under action b. After the decision is made, the rewards are obtained and the process terminates. 
Thus Wt = Ro, the reward obtained at time 0. 

Consider the problem of maximizing E[i? ] subject to the constraint that Var(i? ) < 1/2. There is 
only one feasible deterministic policy (choose action a), and it has zero expected reward. On the other 
hand, a randomized policy that chooses action b with probability p has an expected reward of p and the 
corresponding variance satisfies 

Var(i? ) < E[R 2 ] = pE[R 2 \ A = b] = 2p. 

When < p < 1/4, such a randomized policy is feasible and improves upon the deterministic one. 

Note that for the above instance we have U t , s = ^t,s,w = n^, and U. t}S ,u — ^t,s,w,u = Hence the 
above example establishes all three of the claimed statements, q.e.d. 

B. Information Improves Performance 

We now show that in most cases, performance can improve strictly when we allow a policy to have 
access to more information. The only exception arises for the pair of classes Yl t8VJ<u an d n^u, which we 
show in Section |V] to be equivalent (cf. Theorem©. 

Theorem 2. (a) Il t s is inferior to Ht ;S ,w> an d n t s u is inferior to Ht, s ,w,u- 
(b) Ht,s,w is inferior to Uh. 

Proof. 

(a) Consider the following MDP, with time horizon T = 2. The process starts at the initial state so, at 
which there are two actions. Under action a\, the immediate reward is zero and the process moves 
to a terminal state. Under action a 2 , the immediate reward R is either or 1, with equal probability, 
and the process moves to state s^. At state s 1 , there are two actions, a 3 and a 4 : under action a 3 , the 
immediate reward R\ is equal to 0, and under action 04, it is equal to 1. We are interested in the 
optimal value of the expected reward E[W^ 2 ] = E[i2 + Ri]> subject to the constraint that the variance 



is less than or equal to zero (and therefore equal to zero). Let p be the probability that action a 2 is 
chosen at state s . If p > 0, and under any policy in U t , s ,u, the reward R at state s has positive 
variance, and the reward R\ at the next stage is uncorrected with R . Hence, the variance of R + R 1 
is positive, and such a policy is not feasible; in particular, the constraint on the variance requires 
that p = 0. We conclude that the largest possible expected reward under any policy in U t , s , u (and, a 
fortiori, under any policy in n t s ) is equal to zero. 

Consider now the following policy, which belongs to Ht,s,w and, a fortiori, to Ht,s,w,u- at state s , 
choose action a 2 ; then, at state si, choose a 3 if W\ = R = 1, and choose a 4 if W\ = R = 0. In 
either case, the total reward is R + i?i = 1, while the variance of R + R x is zero, thus ensuring 
feasibility. This establishes the first part of the theorem, 
(b) Consider the following MDP, with time horizon T = 3. At state so there is only one available action; 
the next state Si is either si or s[, with probability p and 1— p, respectively, and the immediate reward 
i?o is zero. At either state s 1 or s[, there is again only one available action; the next state, S 2 , is s 2 , 
and the reward Ri is zero. At state s 2 , there are two actions, a and b. Under action a, the mean and 
variance of the resulting reward R 2 are both zero, and under action b, they are both equal to 1. Let 
us examine the largest possible value of E[W 3 ] = E[i? 2 ], subject to the constraint Var(W / 2 ) < 1/2. 
The class H t ,s,w contains two policies, corresponding to the two deterministic choices of an action 
at state s 2 ; only one of them is feasible (the one that chooses action a), resulting in zero expected 
reward. However, the following policy in has positive expected reward: choose action b at state 
s 2 if and only if the state at time 1 was equal to si (which happens with probability p). As long as p 
is sufficiently small, the constraint Var(VT) < 1/2 is met, and this policy is feasible. It follows that 
Ht,s,w is inferior to Il h . q.e.d. 

IV. Complexity Results 

In this section, we establish that mean-variance optimization in finite horizon MDPs is unlikely to admit 
polynomial time algorithms, in contrast to classical MDPs. 

Theorem 3. The problem mv-mdp(TIJ is NP-hard, when LT is n t s u) , II t)SjW)U , LT^, or U h u . 

Proof: We will actually show NP-hardness for the special case of mv-mdp(LT), in which we wish to 
determine whether there exists a policy whose reward variance is equal to zero. (In terms of the problem 



definition, this corresponds to letting A = —KT and v — 0.) The proof uses a reduction from the 
SUBSET SUM problem: Given n positive integers, does there exist a subset B of {1, . . . ,n} such that 

Given an instance (r 1; . . . , r n ) of SUBSET SUM, and for any of the policy classes of interest, we construct 
an instance of mv-mdp(II), with time horizon T = n + 1, as follows. At the initial state so, there is only 
one available action, resulting in zero immediate reward (Ro = 0). With probability 1/2, the process moves 
to a terminal state; with probability 1/2, the process moves (deterministically) along a sequence of states 
si, . . . , s n . At each state Sj (i — 1, . . . , n), there are two actions: a^, which results in an immediate reward 
of Ti, and bi, which results in an immediate reward of — r,. 

Suppose that there exists a set B C {1, . . . , n} such that J2 ieB r { = Yli&B r i- Consider the policy that 
chooses action a« at state Si if and only if i G B. This policy achieves zero total reward, with probability 
1, and therefore meets the zero variance constraint. Conversely, if a policy results in zero variance, then 
the total reward must be equal to zero, with probability 1, which implies that such a set B exists. This 
completes the reduction. 

Note that this argument applies no matter which particular class of policies is being considered, q.e.d. 

The above proof also applies to the policy classes n tjS and Ht,s,u- However, for these two classes, 
a stronger result is possible. Recall that a problem is strongly NP-hard, if it remains NP-hard when 
rest ricted to instances in whi ch the numerical part of the instance description involves "small" numbers; 



see iGarey and Johnson! (|1979|) for a precise definition. 

Theorem 4. If IT is either TL tjS or H t s „, the problem MV-MDP(TI) z'^ strongly NP-hard. 

Proof. As in the proof of Theorem |3] we will prove the result for the special case of mv-mdp, in which 
we wish to determine whether there exists a policy under which the variance of the reward is equal 
to zero. The proof involves a reduction from the 3-Satisfiability problem (3SAT). An instance of 3SAT 
consists of n Boolean variables x\, . . . , x n , and m clauses C%, . . . , C m , with three literals per clause. Each 
clause is the disjunction of three literals, where a literal is either a variable or its negation. (For example, 
x 2 V x 4 V x 5 is such a clause, where a bar stands for negation.) The question is whether there exists an 
assignment of truth values ("true" or "false") to the variables such that all clauses are satisfied. 

Suppose that we are given an instance of 3SAT, with n variables and m clauses, Ci, . . . ,C m . We 
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construct an instance of MV-MDP(II) as follows. There is an initial state s , a state d , a state cj associated 
with each clause Cj, and a state yi associated with each literal Xj. The actions, dynamics, and rewards 
are as follows: 

(a) Out of state s , there is equal probability, l/(m + 1), of reaching any one of the states d , Ci, . . . , c m , 
independent of the action; the immediate reward is zero. 

(b) State d is a terminal state. At each state Cj, there are three actions available: each action selects one 
of the three literals in the clause, and the process moves to the state yi associated with that literal; 
the immediate reward is 1 if the literal appears in the clause unnegated, and —1 if the literal appears 
in the clause negated. For an example, suppose that the clause is of the form x 2 V x~4 V x 5 . Under 
the first action, the next state is y 2 , and the reward is 1; under the second action, the next state is y 4 
and the reward is —1; under the third action, the next state is y 5 , and the reward is 1. 

(c) At each state yi, there are two possible actions a; and h, resulting in immediate rewards of 1 and 
— 1, respectively. The process then moves to the terminal state d . 

Suppose that we have a "yes" instance of 3SAT, and consider a truth assignment that satisfies all clauses. 
We can then construct a policy in T[ t , s (and a fortiori in n t)SjM , whose total reward is zero (and therefore 
has zero variance) as follows. If x,- L is set to be true (respectively, false), we choose action h L (respectively, 
a,i) at state yi. At state Cj we choose an action associated with a literal that makes the clause to be true. 
Suppose that state cj is visited after the first transition, i.e., Si = Cj. If the literal associated with the 
selected action at cj is unnegated, e.g., the literal x,- L , then the immediate reward is 1. Since this literal 
makes the clause to be true, it follows that the action chosen at the subsequent state, yi, is hi, resulting in 
a reward of —1, and a total reward of zero. The argument for the case where the literal associated with 
the selected action at state Cj is negated is similar. It follows that the total reward is zero, with probability 
1. 

For the converse direction, suppose that there exists a policy in U tjS , or more generally, in n tjSjU under 
which the variance of the total reward is zero. Since the total reward is equal to whenever the first 
transition leads to state d (which happens with probability l/(m + 1), it follows that the total reward 
must be always zero. Consider now the following truth assignment: Xi is set to be true if and only if the 
policy chooses action hi at state y, L , with positive probability. Suppose that the state visited after the first 
transition is Cj. Suppose that the action chosen at state cj leads next to state yi and that the literal Xi 
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appears unnegated in clause Cj. Then, the reward at state Cj is 1, which implies that the reward at state y« 
is —1. It follows that the action chosen at yi is bi, and therefore Xi has been set to be true. It follows that 
clause Cj is satisfied. A similar argument shows that clause Cj is satisfied when the literal x; t associated 
with the chosen action at Cj appears negated. In either case, we conclude that clause Cj is satisfied. Since 
every state Cj is possible at time 1, it follows that every clause is satisfied, and we have a "yes" instance 
of 3 sat. q.e.d. 

V. Exact Algorithms 

The comparison and complexity results of the preceding two sections indicate that the policy classes 
il t s , Ut,s, w , n tiS)U , and LT/j are inferior to the class Uh,u, and furthermore some of them (il t s , Ht,s,w) 
appear to have higher complexity. Thus, there is no reason to consider them further. While the problem 
MV-MDP(il/ l M ) is NP-hard, there is still a possibility for approximate or pseudopolynomial time algorithms. 
In this section, we focus on exact pseudopolynomial time algorithms. 

Our approach involves an augmented state, defined by X t = (S t , W t ). Let X be the set of all possible 
values of the augmented state. Let \S\ be the cardinality of the set S. Let \R\ be the cardinality of the 
set 1Z. Recall also that K = max re 7j \r\. If we assume that the immediate rewards are integers, then W t 
is an integer between — KT and KT. In this case, the cardinality | | of the augmented state space X is 
bounded by |«S| • (2KT +1), which is polynomial. Without the integrality assumption, the cardinality of 
the set X remains finite, but it can increase exponentially with T. For this reason, we study the integer 
case separately in Section IV-BI 

A. State-Action Frequencies 

In this section, we provide some results on the representation of MDPs in terms of a state-action 
frequency polytope, thus setting the stage for our subsequent algorithms. 

For any policy n G LT^ >u , and any x E X, a G A, we define the state-action frequencies at time t by 

z?{x,a) =F n {X t = x,A t = a), t = 0, 1, . . . , T - 1, 

and 

zJ(x)=F 7T (X t = x), t = 0,l,...,T. 
Let z 7 " be a vector that lists all of the above defined state-action frequencies. 
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For any fam ily II of policies, let Z(U) = {z w \ n G II}. The following result is well known (e.g., 



Alta ian. 



1999). It asserts that any feasible state-action frequency vector can be attained by policies that 
depend only on time, the (augmented) state, and a randomization variable. Furthermore, the set of feasible 
state-action frequency vectors is a polyhedron, hence amenable to linear programming methods. 

Theorem 5. (a) We have Z(tt hjU ) = Z(tt t}SjWtU ). 

(b) The set Z(Jl hu ) is a polyhedron, specified by 0(T ■ \X\ ■ \A\) linear constraints. 

Note that a certain mean-variance pair (A, v) is attainable by a policy in Hh,u if an d only if there exists 
some z G Z(Tl hjU ) that satisfies 

^2 wz T {s,w) = A, (1) 

^2 w 2 z T (s,w) = v + X 2 . (2) 
{s,w)ex 

Furthermore, since Z(U h:U ) = Z(U t s w u ), it follows that if a pair (A,t>) is attainable by a policy in Hh,u, 
it is also attainable by a policy in Ht, s ,w,u- This establishes the following result. 

Theorem 6. The policy classes Hh, u and Ht, s ,w,u are equivalent. 

Note that checking the feasibility of the conditions z G Z(J[ h u ), (QQ), and © amounts to solving a linear 
programming problem, with a number of constraints proportional to the cardinality of the augmented state 
space X and, therefore, in general, exponential in T. 

B. Integer Rewards 

In this section, we assume that the immediate rewards are integers, with absolute value bounded by 
K, and we show that pseudopolynomial time algorithms are possible. Recall that an algorithm is a 
pseudopolynomial time algorithm if its running time is polynomial in K and the instance size. (This is 
in contrast to polynomial time algorithms in which the running time can only grow as a polynomial of 
log if.) 

Theorem 7. Suppose that the immediate rewards are integers, with absolute value bounded by K. Consider 
the following two problems: 

(i) determine whether there exists a policy in Ii h u for which (J n , V^) = (A, v ), where A and v are given 
rational numbers; and, 
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(ii) determine whether there exists a policy in Iih,u for which = A and V w < v, where A and v are 

given rational numbers. 
Then, 

(a) these two problems admit a pseudopolynomial time algorithm; and, 

(b) unless P=NP, these problems cannot be solved in polynomial time. 

Proof. 

(a) As already discussed, these problems amount to solving a linear program. In the integer case, the 
number of variables and constraints is bounded by a polynomial in K and the instance size. The 
result follows because linear programming can be solved in polynomial time. 

(b) This is proved by considering the special case where A = v = and the exact same argument as in 
the proof of Theorem |3] q.e.d. 

Similar to constrained MDPs, mean-variance optimization involves two different performance criteria. 
Unfortunately, however, the linear programming approach to constrained MDPs does not translate into an 
algorithm for the problem MV-MDP(Il /i u ). The reason is that the set 

PMV = {(Jn, V n ) | 7T G U htU } 

of achievable mean-variance pairs need not be convex. To bring the constrained MDP methodology to 
bear on our problem, instead of focusing on the pair (J n , V n ), we define Q w = ~K n [Wj], and focus on the 
pair (J W ,Q W ). This is now a pair of objectives that depend linearly on the state frequencies associated 
with the final augmented state X T . Accordingly, we define 

PMQ = {(J-k,Q-k) I 7T G H h ,u}- 

Note that Pmq is a polyhedron, because it is the image of the polyhedron Z(ILh, u ) under the linear 
mapping specified by the left-hand sides of Eqs. ©-©• In contrast, P MV is the image of Pmq under a 
nonlinear mapping: 

PMv = {(A,g-A 2 )|(A,g)GP M Q}, 

and is not, in general, a polyhedron. 

As a corollary of the above discussion, and for the case of integer rewards, we can exploit convexity 
to devise pseudopolynomial algorithms for problems that can be formulated in terms of the convex 
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set Pmq- On the other hand, because of the non-convexity of Pmvi we have not been able to devise 
pseudopolynomial time algorithms for the problem MV-MDP(il/ i n ), or even the simpler problem of deciding 
whether there exists a policy ir E Hh, u that satisfies V w < v, for some given number v, except for the 
very special case where v = 0, which is the subject of our next result. For a general v, an approximation 
algorithm will be presented in the next section. 

Theorem 8. (a) If there exists some n E Yi h u for which V w = 0, then there exists some ir' E Ut, s ,w fa r 
which V n > = 0. 

(b) Suppose that the immediate rewards are integers, with absolute value bounded by K. Then the problem 
of determining whether there exists a policy n E Hh, u far which K- = admits a pseudopolynomial 
time algorithm. 

Proof. 

(a) Suppose that there exists some n E Iih,u f° r which V w = 0. By Theorem [6J n can be assumed, 
without loss of generality, to lie in TL tjSiW>u . Let Var 7r (W / r | U ,t), be the conditional variance of 
Wt, conditioned on the realization of the randomization variables Uq-t- We have Var 7r (W / r) > 
E 7r [Var 7r (W r | U , T )], which implies that there exists some w 0: r sucn tnat Var 7r (W T | U 0:T = u -t) = 0. 
By fixing the randomization variables to this particular uq-t, we obtain a deterministic policy, in Ht,s,w 
under which the reward variance is zero. 

(b) If there exists a policy under which V n = 0, then there exists an integer k, with \k\ < KT such that, 
under this policy, Wt is guaranteed to be equal to k. Thus, we only need to check, for each k in the 
relevant range, whether there exists a policy such that (J w , V w ) = (k, 0). By Theorem |Vj this can be 
done in pseudopolynomial time, q.e.d. 

The approach in the proof of part (b) above leads to a short argument, but yields a rather inefficient 
(albeit pseudopolynomial) algorithm. A much more efficient and simple algorithm is obtained by realizing 
that the question of whether Wt can be forced to be k, with probability 1, is just a reachability game: 
the decision maker picks the actions and an adversary picks the ensuing transitions and rewards (among 
those that have positive probability of occurring). The decision maker wins the game if it can guarantee 
that Wt = k. Such sequential games are easy to solve in time polynomial in the number of (augmented) 
states, decisions, and the time horizon, by a straightforward backward recursion. On the other hand a 
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genuinely polynomial time algorithm does not appear to be possible; indeed, the proof of Theorem [3] 
shows that the problem is NP-complete. 

VI. Approximation Algorithms 

In this section, we deal with the optimization counterparts of the problem M\-MDP(Uh, u )- We are 
interested in computing approximately the following two functions: 

v*{\)= inf V n , (3) 

{7ren hiU :J x >A} 

and 

X*(v) = sup J n . (4) 
{7!-en hiU :V^<v} 

If the constraint J w > A (respectively, < v) is infeasible, we use the standard convention v*(X) = oo 
(respectively, X*(v) = — oo). Note that the infimum and supremum in the above definitions are both 
attained, because the set P M v of achievable mean-variance pairs is the image of the polyhedron P M q 
under a continuous map, and is therefore compact. 

We do not know how to efficiently compute or even generate a uniform approximation of either v*(X) 
or X*(v) (i.e., find a value v' between v*(X) — e and v*(X) + e, and similarly for X*(v)). In the following 
two results we consider a weaker notion of approximation that is computable in pseudopolynomial time. 
We discuss v*(X) as the issues for X*(v) are similar. 

For any positive e and v, we will say that v(-) is an (e, z/)-aproximation of v*(-) if, for every A, 

v*(X-u)-e<v(X) <v*(X + is) + e. (5) 



Papadimitriou and Yannakakis 



(2000): 



This is an approximation of the same kind as those considered in 
it returns a value v such that (X,v) is an element of the "(e + v) -approximate Pareto boundary" of the 
set Pmv- F° r a different view, the graph of the function v(-) is within Hausdorf distance e + v from the 
graph of the function v*(-). 

We will show how to compute an (e, z/)-aproximation in time which is pseudopolynomial, and polyno- 
mial in the parameters 1/e, and 1/V. 

We start in Section IVI-AI with the case of integer rewards, and build on the pseudopolynomial time 
algorithms of the preceding section. We then consider the case of general rewards in Section IVI-BI We 
finally sketch an alternative algorithm in Section IVI-CI based on set- valued dynamic programming. 



15 

A. Integer Rewards 

In this section, we prove the following result. 

Theorem 9. Suppose that the immediate rewards are integers. There exists an algorithm that, given e, v, 
and X, outputs a value v(X) that satisfies (0), and which runs in time polynomial in \S\, \A\, T, K, 1/e, 
and 1/v. 

Proof. Since the rewards are bounded in absolute value by K, we have v*(X) = oo for A > KT and 
v*{\) = v*{—KT) for A < -KT. For this reason, we only need to consider A G [-KT, KT]. To simplify 
the presentation, we assume that e = v. We let 5 be such that e = 3SKT. 

The algorithm is as follows. We consider grid points Aj defined by Aj = — KT + [i — 1)5, i = 1, . . . , n, 
where n is chosen so that A„_i < KT, X n > KT. Note that n = 0(KT/5). For % = 1, . . . ,n - 1, 
we calculate g(Aj), the smallest possible value of E[W|], when E[W r ] is restricted to lie in [Aj,A i+1 ]. 
Formally, 

q(Xi) = min jg | 3 A' G [A, ; ,A i+ i] s.t. (X',q) G Pmq}- 

We let u(\i) = q(\i) — A^ +1 , which can be interpreted as an estimate of the least possible variance when 
E[Wt] is restricted to the interval [Aj, X i+ i\. Finally, we set 

v(X) = minu(Xi), if A G [A fc , Afe+i]. 

i>k 

The main computational effort is in computing g(Aj) for every i. Since Pmq is a polyhedron, this 
amounts to solving 0(KT/5) linear programming problems. Thus, the running time of the algorithm has 
the claimed properties. 

We now prove correctness. Let q*(X) = min{g | (A, q) G Pmq}, and u*(X) = q*(X) — A 2 , which is the 
least possible variance for a given value of A. Note that v*(X) = min{w*(A') | A' > A}. 

We have q(Xi) < q*(X'), for all A' G [X h X i+1 \. Also, -A 2 +1 < -(A') 2 , for all A' G [A i; A i+1 ]. By adding 
these two inequalities, we obtain w(Aj) < u*(X'), for all A' G [Aj, Aj+i]. Given some A, let k be such that 
A G [A fc , Afc+i]. Then, 

■0(A) = minM(Aj) < min u*(X') < min-u*(A') = v*(X'), 

i>k A'>A 

so that v(X) is always an underestimate of v*(X). 
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We now prove a reverse inequality. Fix some A and let k be such that A G [\ k , A fc+1 ]. Let i > k be 
such that v(X) = «(Aj). Let also A G [A*, A i+ i] be such that q*(X) = q(\i)- Note that 

A 2 +1 - A 2 < A 2 +1 - A 2 = 6(\i + Ai+i) < 25(KT + 5) < 35KT. (6) 

Then, 

■0(A) = u(Xi) ( = } g(A,,)-A 2 +1 ( = ) g*(A)-A 2 +1 >g*(A)-A 2 -3^T 

= u*(A) - 35 KT > u*(A) - > v*{\ -5)- 35KT 

(h) 

> v*{X-e)-e. 

In the above, (a) holds by the definition of i; (b) by the definition of «(Aj); (c) by the definition of A; 
and (d) follows from Eq. ©. Equality (e) follows from the definition of «*(•). Inequality (f) follows from 
the definition of i>*(-); and (g) is obtained because v*(-) is nondecreasing and because A > A — 5. (The 
latter fact is seen as follows: (i) if i > k, then A < A^+i < Aj < A; (ii) if i — k, then both A and A 
belong to [Afc, A^+i], and their difference is at most 5.) Inequality (h) is obtained because of the definition 
e = 35KT, the observation 5 < e, and the monotonicity of v*(-). q.e.d. 

B. General Rewards 

When rewards are arbitrary, we can discretize the rewards and obtain a new MDP. The new MDP is 
equivalent to one with integer rewards to which the algorithm of the preceding subsection can be applied. 
This is a legitimate approximation algorithm for the original problem because, as we will show shortly, 
the function «*(•) changes very little when we discretize using a fine enough discretization. 

We are given an original MDP Ai = (T,S,A,1Z,p,g) in which the rewards are rational numbers in 
the interval [—K,K], and an approximation parameter e. We fix a positive number 5, a discretization 
parameter whose value will be specified later. We then construct a new MDP M! = (T,S,A,lZ',p,g'), 
in which the rewards are rounded down to an integer multiple of 5. More precisely, all elements of the 
reward range TV are integer multiples of 5, and for every t,s,aE {0, 1, . . . , T — 1} x S x A, and any 
integer n, we have 

g t (5n \s,a)= flr t (r|s,o). 

r: 8n<r<8(n+l) 
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We denote by J, Q and by J', Q' the first and second moments of the total reward in the original and 
new MDPs, respectively. Let Hh, u and W h u be the sets of (randomized, history-based) policies in Ai and 
Ai' , respectively. Let Pmq and P' M n be the associated polyhedra. 

We want to to argue that the mean-variance tradeoff curves for the two MDPs are close to each other. 
This is not entirely straightforward because the augmented state spaces (which include the possible values 
of the cumulative rewards W t ) are different for the two problems and, therefore, the sets of policies are 
also diff erent. A conceptua lly simple but somewhat tedious approach involves an argument along the 



lines of Whitt (1978 



1979J), generalized to the case of constrained MDPs; we outline such an argument 
in Section IVI-C1 Here, we follow an alternative approach, based on a coupling argument. 

Proposition 1. There exists a polynomial function c(K,T) such that the Hausdorf distance between Pmq 
and P' M q is bounded above by 2KT 2 5. More precisely, 

(a) For every policy n G Hh,u, there exists a policy n' e 11^ such that 

max j| 4, - J w |, \Q' n , - Q.|} < 2KT 2 S. 

(b) Conversely, for every policy U' h u , there exists a policy Hh,u suc h that the above inequality again 
holds. 

Proof. We denote by d{r) the discretized value of a reward r, that is, d{r) = max{n<5 : n5 < r, n e Z}. 
Let us consider a third MDP Ai" which is identical to Ai' , except that its rewards R" are generated as 
follows. (We follow the convention of using a single or double prime to indicate variables associated with 
Ai' or Ai", respectively.) A random variable R t is generated according to the distribution prescribed by 
g t {r | St, a t ), and its value is observed by the decision maker, who then incurs the reward R" = d(R t ). Let 
P M n be the polyhedron associated with Ai". We claim that P M q = P'mq- The only difference between 
Ai' and Ai" is that the decision maker in Ai" has access to the additional information R t — d(R t ). 
However, this information is inco sequential: it does not affect the future transition probabilities or reward 
distributions. Thus, R t —d(R t ) can only be useful as an additional randomization variable. Since P' M q is the 
set of achievable pairs using general (history-based randomized) policies, having available an additional 
randomization variable does not change the polyhedron, and P'mq = P'mq- Thus, to complete the proof 
it suffices to show that the polyhedra P M q and P MQ are close. 

Let us compare the MDPs Ai and Ai". The information available to the decision maker is the same 
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for these two MDPs (since all the history of reward truncations {R T — d(R T )} t 1 l} 1 is available in Ai" 
for the decision at time t). Therefore, for every policy in one MDP, there exists a policy for the other 
under which (if we define the two MDPs on a common probability space, involving common random 
generators) the exact same sequence of states (S t = S' t ), actions (A t = A' t ), and random variables R t is 
realized. The only difference is that the rewards are R t and d(R t ), in Ai and Ai", respectively. Recall 
that < R t — d(R t ) < 5. We obtain that for every policy n E II, there exists a policy n" E U" for which 
< W T -W£ = Y, T T Zl {Rt-d(Rt)) < 5T, and therefore, |W|- (W£) 2 | < 2KT 2 5. Taking expectations, 
we obtain | J n — J"\ < T5, IQ^ — Q"\ < 2KT 2 5. This completes the proof of part (a). The proof of part 
(b) is identical, q.e.d. 

Theorem 10. There exists an algorithm that, given e, v, and X, outputs a value v(X) that satisfies (|5]), 
and which runs in time polynomial in \S\, \A\, T, K, 1/e, and \jv. 

Proof. Assume for simplicity that v = e. Given the value of e, let 5 be such that e/2 = 2KT 2 5, and 
construct the discretized MDP Ai'. Run the algorithm from Theorem|9]to find an (e/2, e/2) -approximation 
v for Ai'. Using Proposition [0 it is not hard to verify that this yields an (e, e) -approximation of v*(X). 
q.e.d. 

C. An Exact Algorithm and its Approximation 

There are two general approaches for constructing approximation algorithms, (i) One can discretize 
the problem, to obtain an easier one, and then apply an algorithm specially tailored to the discretized 
problem; this was the approach in the preceding subsection, (ii) One can design an exact (but inefficient) 
algorithm for the original problem and then implement the algorithm approximately. This approach will 
work provided the approximations do not build up excessively in the course of the algorithm. In this 
subsection, we elaborate on the latter approach. 

We defined earlier the polyhedron Pmq as the set of achievable first and second moments of the 
cumulative reward starting at time zero at the initial state. We extend this definition by considering 
intermediate times and arbitrary (intermediate) augmented states. We let 

C t (s, w) = {(A, q) : 3tt E U h , u s.t. E n [W T | S t = s, W t = w] = A and (7) 

E 7T [W 2 \S t = s,W t = w)=q}. 
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Clearly, C (s, 0) = Pmq- Using a straightforward backwards induction, it can be shown that C t (-,-) 
satisfies the set-valued dynamic programming recursion E 

C t (s,w) = conv«j ^2p t (s' | s,a)^# t (r | s,a)C t+ i(s',w +r) > , (8) 

for every s E S, w el, and for £ = 0, 1, 2, . . . , T — 1, initialized with the boundary conditions 

C T (s,w) = {(w,w 2 )}. (9) 

A simple inductive proof shows that the sets C t (s, w) are polyhedra; this is because Ct(s,w) is either 
empty or a singleton and because the sum or convex hull of finitely many polyhedra is a polyhedron. Thus, 
the recursion involves a finite amount of computation, e.g., by representing each polyhedron in terms of its 
finitely many extreme points. In the worst case, this translates into an exponential time algorithm, because 
of the possibly large number of extreme points. However, such an algorithm can also be implemented 
approximately. If we allow for the introduction of an 0(e/T) error at each stage (where error is measured 
in terms of the Hausdorf distance), we can work with approximating polyhedra that involve only 0(1/ e) 
extreme points, while ending up with a 0(e) total error; this is because we are approximating polyhedra in 
the plane, as opposed to higher dimensions where the dependence on e would have been worse dependence. 
The details are straightforward but somewhat tedious and are omitted. On the other hand, in practice, this 
approach is likely to be faster than the algorithm of the preceding subsection. 

VII. Conclusions 

We have shown that mean-variance optimization problems for MDPs are typically NP-hard, but some- 
times admit pseudopolynomial approximation algorithms. We only considered finite horizon problems, 
but it is clear that the negative results carry over to their infinite horizon counterparts. Furthermore, given 
that the contribution of the tail of the t ime horizon in in finite horizon discounted problems (or in "proper" 



stochastic shortest path problems as in 



Bertsekas 



(|1995l) ) can be made arbitrarily small, our approximation 



algorithms can also yield approximation algorithms for infinite horizon problems. 

Two more problems of some interest deal with finding a policy that has the smallest possible, or the 
largest possible variance. There is not much we can say here, except for the following: 

2 If X and Y are subsets of a vector space and a a scalar, we let aX = {ax j x £ X} and X + Y — {x + y\ x£X, y £ Y}. 
Furthermore, if for every a £ A, we have a set X a , then conv aS .4{X Q } is the convex hull of the union of these sets. 
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(a) The smallest possible variance is attained by a deterministic policy, that is, 

min Vx = min V n . 

T6n ftj „ 7ren h 

This is proved using the inequality Var 7r (WV) > E 7r [Var vr (W / r | Uot)}. 

(b) Variance will be maximized, in general, by a randomized policy. To see this, consider a single stage 
problem and two actions with deterministic rewards, equal to and 1, respectively. Variance is 
maximized by assigning probability 1/2 to each of the actions. The variance maximization problem 
is equivalent to maximizing the concave function q — A 2 subject to (A, q) E Pmq- This is a quadratic 
programming problem over the polyhedron P M q and therefore admits a pseudopolynomial time 
algorithm, when the rewards are integer. 

Our results suggest several interesting directions for future research, which we briefly outline below. 

First, our negative results apply to general MDPs. It would be interesting to determine whether the 
hardness results remain valid for specially structured MDPs. One possibly interesting special case involves 
multi-armed bandit problems: there are n separate MDPs ("arms"); at each time step, the decision maker 
has to decide which MDP to activate, while the other MDPs remain inactive. Of particular interest here 



are index policies that compute a value ("index") for each MDP and select an MD P wit h maxim a 



Gittinsl (1 19791) and 



Whittle ( 



index; 



1988)). 



such policies are often optimal for the classical formulations (see 
Obtaining a policy that uses some sort of an index for the mean-variance problem or alternatively proving 
that such a policy cannot exist would be interesting. 

Second, a number of complexity questions have been left open. We list a few of them: 

(a) Is there a pseudopolynomial time algorithm for computing v*{\) or A*(t>) exactly? 

(b) Is there a polynomial or pseudopolynomial time algorithm that computes v*{\) or A*(t>) within a 
uniform error bound e? 

(c) Is the problem of computing £>(A) with the properties in Eq. © NP-hard? 

(d) Is there a pseudopolynomial time algorithm the smallest possible variance in the absence of any 
constraints on the mean cumulative reward? 

Third, bias-variance tradeoffs may pay an importa nt role in sp eeding up certain control and learning 



heuristics, such as those involving control variates (|Meyn , 



2008J). Perhaps mean-variance optimization 



can be used to address the exploration/exploitation tradeoff in model-based reinforcement learning, with 



21 



Sutt on and Bartol (1998) for 



variance reduction serving as a means to reduce the exploration time (see 
a general discussion of exploration-exploitation in reinforcement learning). Of course, in light of the 
computational complexity of bias-variance tradeoffs, incorporating bias-variance tradeoffs in learning 
makes sense only if experimentation is nearly prohibitive and computation time is cheap. Such an approach 
could be particularly useful if a coarse, low-complexity, approximate solution of a bias-variance tradeoff 
problem can result in significant exploration speedup. 

Fourth, we only considered mean-variance tradeoffs in this paper. However, there are other interesting 
and pote ntially useful criteria th at can be used to incorporate risk into multi-stage decision making. For 
example, ILiu and Koenigl (120051) consider a utility function with a single switch. Many other risk aware 
criteria have been considered in the single stage case. It would be interesting to develop a comprehensive 
theory for the complexity of solving multi-stage decision problems under general (monotone convex or 
concave) utility function and under risk constraints. This is especially interesting for the approximation 
algorithms presented in Section |VI] 
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